DATA 5600 - Introduction to Linear Regression and Machine Learning
Group Data Analysis 2 - Multiple Linear Regression
Dallin Moore & Vincent Rohrer
Introduction¶
The data that we are observing is the demand for bikes through a bike rental service in Seoul South Korea. Our goal is to develop a model to accurately predict this demand based on the variables provided in the data, including the effect that the day of the week, hour, or weather, has on demand. Assuming our model is accurate, we will then move forward with practical applications for predicting demand, including determining maintenance times at low demand periods and implementing dynamic pricing strategies
Citation: Seoul Bike Sharing Demand [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5F62R.
Load in Libraries¶
%pip install ucimlrepo
from ucimlrepo import fetch_ucirepo
import textwrap as tw
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import ElasticNetCV
from sklearn.linear_model import ElasticNet
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from statsmodels.graphics.factorplots import interaction_plot
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.anova import anova_lm
from itertools import combinations
# suppress warnings
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
Examine the Dataset¶
# fetch dataset
full_bike_df = fetch_ucirepo(id=560)
# metadata
print("Summary:",tw.fill(full_bike_df.metadata['additional_info']['summary'], width=80))
print()
print(full_bike_df.metadata['additional_info']['variable_info'])
Summary: Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes. The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information. Date : year-month-day Rented Bike count - Count of bikes rented at each hour Hour - Hour of he day Temperature-Temperature in Celsius Humidity - % Windspeed - m/s Visibility - 10m Dew point temperature - Celsius Solar radiation - MJ/m2 Rainfall - mm Snowfall - cm Seasons - Winter, Spring, Summer, Autumn Holiday - Holiday/No holiday Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)
To prepare the dataset:
Shrink the DataFrame down to 5,000 observations.
Given that the
Functioning Daywould not be given to make predictions on, it will be dropped from the dataset. Observations when the service was not functioning will also be removed. (Note:Demandis also0ifFunctioning DayisNo.)Seasonis dropped as there is expected to be a big multicollinearity with all of the weather variables.
# Create a DataFrame from the features/targets and make it smaller
bike_df = pd.DataFrame(full_bike_df.data.features)
bike_df[str(full_bike_df.data.targets.columns[0])] = full_bike_df.data.targets
bike_df = bike_df[bike_df['Functioning Day'] == 'Yes']
bike_df = bike_df.drop('Functioning Day', axis=1)
bike_df = bike_df.drop('Seasons', axis=1)
bike_df = bike_df.sample(n=5000, random_state=1) # random_state for reproducibility
bike_df.columns = bike_df.columns.str.replace(' ', '_') # replace spaces with underscores
bike_df.head()
| Date | Rented_Bike_Count | Hour | Temperature | Humidity | Wind_speed | Visibility | Dew_point_temperature | Solar_Radiation | Rainfall | Snowfall | Holiday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2634 | 20/03/2018 | 1111 | 18 | 4.1 | 43 | 3.5 | 2000 | -7.3 | 0.21 | 0.0 | 0.0 | No Holiday |
| 4627 | 11/6/2018 | 873 | 19 | 19.4 | 85 | 1.2 | 1651 | 16.8 | 0.08 | 0.5 | 0.0 | No Holiday |
| 8392 | 15/11/2018 | 1123 | 16 | 15.4 | 34 | 1.4 | 1016 | -0.4 | 0.68 | 0.0 | 0.0 | No Holiday |
| 1768 | 12/2/2018 | 239 | 16 | -2.4 | 33 | 4.5 | 1815 | -16.5 | 1.55 | 0.0 | 0.0 | No Holiday |
| 4138 | 22/05/2018 | 1119 | 10 | 20.7 | 55 | 1.8 | 1698 | 11.3 | 1.43 | 0.0 | 0.0 | Holiday |
Exploratory Data Analysis¶
Continuous Data¶
bike_df.drop('Hour', axis=1).describe()
| Rented_Bike_Count | Temperature | Humidity | Wind_speed | Visibility | Dew_point_temperature | Solar_Radiation | Rainfall | Snowfall | |
|---|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 734.936000 | 12.870440 | 58.185400 | 1.728060 | 1427.501600 | 4.065960 | 0.568016 | 0.158780 | 0.078080 |
| std | 646.021839 | 12.107273 | 20.389894 | 1.023752 | 611.465364 | 13.268638 | 0.860439 | 1.166411 | 0.466003 |
| min | 2.000000 | -17.800000 | 0.000000 | 0.000000 | 27.000000 | -30.500000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 212.000000 | 3.100000 | 42.750000 | 1.000000 | 924.750000 | -5.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 555.000000 | 13.700000 | 57.000000 | 1.500000 | 1673.000000 | 4.900000 | 0.010000 | 0.000000 | 0.000000 |
| 75% | 1095.000000 | 22.700000 | 74.000000 | 2.300000 | 2000.000000 | 15.300000 | 0.930000 | 0.000000 | 0.000000 |
| max | 3384.000000 | 39.400000 | 98.000000 | 7.400000 | 2000.000000 | 26.600000 | 3.520000 | 35.000000 | 8.800000 |
sns.pairplot(bike_df)
plt.suptitle('Pairplot of Continuous Variables', y=1.02)
plt.show()
Seeing the high linear association between temperature and dew point temperature is an immediate red flag. One of these variables will likely need to be removed to mitigate any multicollinearity; our selection methods will take care of this.
def show_distribution(column):
column_name = column if len(column)< 13 else column[:13]+'...'
fig, axes = plt.subplots(1, 2, figsize=(6, 2))
# Plot the histogram with density curve on the first axis
sns.histplot(bike_df[column], kde=True, ax=axes[0]) # Use kde=True for a density curve
axes[0].set_title('Distribution of ' + column_name)
axes[0].set_xlabel(column)
axes[0].set_ylabel('Frequency')
# Plot the box plot on the second axis
sns.boxplot(x=bike_df[column], ax=axes[1])
axes[1].set_title('Box Plot of ' + column_name)
axes[1].set_xlabel(column)
plt.tight_layout() # Adjust spacing between plots
plt.show()
# Example usage
show_distribution('Rented_Bike_Count')
show_distribution('Temperature')
show_distribution('Visibility')
show_distribution('Wind_speed')
show_distribution('Humidity')
show_distribution('Dew_point_temperature')
show_distribution('Solar_Radiation')
show_distribution('Rainfall')
show_distribution('Snowfall')
Variables to be concerned about from the start are solar radiation, rainfall, and snowfall, which all have a very small inter quartile range at very low values, indicating that these values are mostly very low and the distribution is highly right skewed. This means that there are also many outliers for each of these variables.
Categorical Data¶
The following variables will be transformed into continuous variables:
- Date → Day of the week
- Hour (continuous) → Hour (categorical)
- Rain/Snow (continuous) → Precipitation (binary)
* Snow and Rain amounts are kept in dataset as well as the new Precipitation column to account for the large amount of data that has no percipitation.
plt.figure(figsize=(4, 4))
# Create the boxplot
sns.boxplot(x='Holiday',
y='Rented_Bike_Count',
data=bike_df)
# Calculate value counts
counts = bike_df['Holiday'].value_counts()
# Add counts to the plot
for i, count in enumerate(counts):
plt.text(i, 1, f'Count: {count}', ha='center', va='bottom') # Adjust the y position as needed
plt.title('Rented Bike Count by Holiday')
plt.xlabel('Holiday')
plt.ylabel('Rented Bike Count')
plt.show()
plt.figure(figsize = (6, 4))
sns.boxplot(x = 'Hour',
y = 'Rented_Bike_Count',
data = bike_df)
plt.title('Rented Bike Count by Hour')
plt.xlabel('Hour')
plt.ylabel('Rented Bike Count')
plt.show()
There are clear trends with the box plots as we can see low rented bike count trends during the early morning, and higher demand at high travel times that correspond with the average working day in South Korea (9:00 AM - 6:00 PM).
bike_df['Date'] = pd.to_datetime(bike_df['Date'], dayfirst=True)
bike_df['Day'] = bike_df['Date'].dt.day_name()
plt.figure(figsize = (8, 4))
sns.boxplot(x = 'Day',
y = 'Rented_Bike_Count',
data = bike_df,
order = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
plt.title('Rented Bike Count by Weekday')
plt.xlabel('Weekend Status')
plt.ylabel('Rented Bike Count')
plt.show()
Though it is not as extreme as the time of day, there are observable trends in the rented bike count for each day of the week. Sunday has the lowest median while friday is the highest. There is little fluctuation from day to day, but enough to see that weekends appear to experience less demand than weekdays.
bike_df['Precipitation'] = ((bike_df['Rainfall'] > 0) | (bike_df['Snowfall'] > 0)).astype(int)
plt.figure(figsize = (6, 4))
sns.boxplot(x = 'Precipitation',
y = 'Rented_Bike_Count',
data = bike_df)
plt.title('Rented Bike Count by Precipitation')
plt.xlabel('Precipitation')
plt.ylabel('Rented Bike Count')
plt.show()
Rain appears to have a significant effect on the rented bike count, as days with any precipitation have a dramatic decrease in thier total rented bike count.
Variable Selection¶
Dummy Code Categorical Variables¶
Create a new DataFrame that contains all of the variables, including the dummy coded variables, for the analysis.
# dummy code categorical variables
bike_df_dummy = bike_df.copy()
bike_df_dummy = pd.get_dummies(bike_df, columns=['Day', 'Hour'], drop_first=False, dtype=int)
bike_df_dummy.drop(columns = ['Day_Sunday'], inplace = True) # Use sunday as the baseline since it has the lowest average rented bike count
bike_df_dummy.drop(columns = ['Hour_0'], inplace = True)
bike_df_dummy.drop(columns = ['Date'], inplace = True)
bike_df_dummy['Holiday'] = bike_df_dummy['Holiday'].map({'Holiday': 1.0, 'No Holiday': 0.0})
bike_df_dummy.head()
| Rented_Bike_Count | Temperature | Humidity | Wind_speed | Visibility | Dew_point_temperature | Solar_Radiation | Rainfall | Snowfall | Holiday | ... | Hour_14 | Hour_15 | Hour_16 | Hour_17 | Hour_18 | Hour_19 | Hour_20 | Hour_21 | Hour_22 | Hour_23 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2634 | 1111 | 4.1 | 43 | 3.5 | 2000 | -7.3 | 0.21 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4627 | 873 | 19.4 | 85 | 1.2 | 1651 | 16.8 | 0.08 | 0.5 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 8392 | 1123 | 15.4 | 34 | 1.4 | 1016 | -0.4 | 0.68 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1768 | 239 | -2.4 | 33 | 4.5 | 1815 | -16.5 | 1.55 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4138 | 1119 | 20.7 | 55 | 1.8 | 1698 | 11.3 | 1.43 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 40 columns
# Split the data into response variable y and predictor variables X
y = bike_df_dummy['Rented_Bike_Count']
X = bike_df_dummy.drop('Rented_Bike_Count', axis = 1)
# Print the shapes to verify
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)
Shape of X: (5000, 39) Shape of y: (5000,)
Correlation Matrix¶
Before seeing what features the selection methods suggest removing, a correlation matrix will helpful to identify variables that have high correlation with eachother that should probably be removed.
correlation_matrix = bike_df_dummy.corr()
rounded_correlation_matrix = correlation_matrix.round(2)
# Create a custom mask where correlations less than ±0.5 are not shown
mask = np.abs(rounded_correlation_matrix) < 0.5
# Create the heatmap with only significant correlations displayed
plt.figure(figsize=(12, 9))
sns.heatmap(correlation_matrix,
cmap='coolwarm', # color palette
annot=rounded_correlation_matrix.where(~mask).fillna(''), # only show values > ±0.5
fmt="", # format the annotation to 2 decimal places
vmin=-1, # start color legend at -1
vmax=1, # end color legend at 1
mask=np.triu(correlation_matrix),
annot_kws={"size": 6}) # remove the upper half of the plot
plt.title("Correlation Heatmap of Selected Variables (values >+/-0.5 shown)")
plt.show()
Either Dew Point Temperature or Temperature will definitely need to be removed based on the correlation value of 0.92. Humidity and Visibility have a high value that we should keep in mind in the upcoming steps.
Selection Methods¶
Sequential Replacement and Backwards Stepwise Selection will be used to determine which variables should be removed (Best Subsets is not computationally plausible).
Sequential Replacement¶
seqrep_selection = SFS(LinearRegression(fit_intercept = True),
k_features = (1, len(X.columns)),
forward = True,
floating = True,
scoring = 'neg_mean_squared_error',
cv = 5)
seqrep = seqrep_selection.fit(X, y)
print('Sequential Replacement Stepwise Selection:', seqrep.k_feature_names_)
Sequential Replacement Stepwise Selection: ('Temperature', 'Humidity', 'Wind_speed', 'Visibility', 'Solar_Radiation', 'Rainfall', 'Snowfall', 'Holiday', 'Precipitation', 'Day_Friday', 'Day_Monday', 'Day_Saturday', 'Day_Thursday', 'Day_Tuesday', 'Day_Wednesday', 'Hour_1', 'Hour_2', 'Hour_3', 'Hour_4', 'Hour_5', 'Hour_6', 'Hour_7', 'Hour_8', 'Hour_10', 'Hour_11', 'Hour_12', 'Hour_13', 'Hour_14', 'Hour_15', 'Hour_16', 'Hour_17', 'Hour_18', 'Hour_19', 'Hour_20', 'Hour_21', 'Hour_22', 'Hour_23')
# Code to see the next best models
seqrep_results = pd.DataFrame.from_dict(seqrep.get_metric_dict()).T
seqrep_results['missing_features'] = seqrep_results['feature_names'].apply(lambda x: [col for col in X.columns if col not in x])
seqrep_results_sorted = seqrep_results.sort_values(by = 'avg_score', ascending = False)
seqrep_results_sorted[['missing_features','std_dev']].head(15)
| missing_features | std_dev | |
|---|---|---|
| 37 | [Dew_point_temperature, Hour_9] | 5230.304404 |
| 38 | [Dew_point_temperature] | 5344.897024 |
| 35 | [Dew_point_temperature, Hour_12, Hour_13, Hour... | 5493.476962 |
| 34 | [Dew_point_temperature, Hour_1, Hour_12, Hour_... | 5520.913622 |
| 36 | [Dew_point_temperature, Hour_12, Hour_15] | 5376.756829 |
| 33 | [Dew_point_temperature, Hour_1, Hour_10, Hour_... | 5573.969762 |
| 32 | [Wind_speed, Dew_point_temperature, Hour_1, Ho... | 5857.374961 |
| 39 | [] | 5432.826142 |
| 31 | [Wind_speed, Dew_point_temperature, Day_Saturd... | 6054.465006 |
| 30 | [Wind_speed, Dew_point_temperature, Day_Saturd... | 5742.809293 |
| 29 | [Wind_speed, Dew_point_temperature, Day_Monday... | 6001.001693 |
| 28 | [Wind_speed, Dew_point_temperature, Day_Monday... | 6104.228572 |
| 27 | [Wind_speed, Dew_point_temperature, Day_Monday... | 6089.6663 |
| 26 | [Wind_speed, Dew_point_temperature, Day_Monday... | 6193.715188 |
| 25 | [Wind_speed, Visibility, Dew_point_temperature... | 5867.066821 |
Backward Stepwise Selection¶
backward_selection = SFS(LinearRegression(fit_intercept = True),
k_features = (1, len(X.columns)),
forward = False,
floating = False,
scoring = 'neg_mean_squared_error',
cv = 5)
backward = backward_selection.fit(X, y)
print('Backward Stepwise Selection:', backward.k_feature_names_)
Backward Stepwise Selection: ('Temperature', 'Humidity', 'Wind_speed', 'Visibility', 'Solar_Radiation', 'Rainfall', 'Snowfall', 'Holiday', 'Precipitation', 'Day_Friday', 'Day_Monday', 'Day_Saturday', 'Day_Thursday', 'Day_Tuesday', 'Day_Wednesday', 'Hour_1', 'Hour_2', 'Hour_3', 'Hour_4', 'Hour_5', 'Hour_6', 'Hour_7', 'Hour_8', 'Hour_10', 'Hour_11', 'Hour_12', 'Hour_13', 'Hour_14', 'Hour_15', 'Hour_16', 'Hour_17', 'Hour_18', 'Hour_19', 'Hour_20', 'Hour_21', 'Hour_22', 'Hour_23')
backward_results = pd.DataFrame.from_dict(backward.get_metric_dict()).T
backward_results_sorted = backward_results.sort_values(by = 'avg_score', ascending = False)
# add a missing_features column that indicates what features are not present in the feature_names column given the X.column_names_
backward_results_sorted['missing_features'] = backward_results_sorted['feature_names'].apply(lambda x: [col for col in X.columns if col not in x])
backward_results_sorted[['missing_features','std_dev']].head(15)
| missing_features | std_dev | |
|---|---|---|
| 37 | [Dew_point_temperature, Hour_9] | 5230.304404 |
| 38 | [Dew_point_temperature] | 5344.897024 |
| 36 | [Dew_point_temperature, Hour_9, Hour_16] | 5287.908222 |
| 35 | [Wind_speed, Dew_point_temperature, Hour_9, Ho... | 5544.034504 |
| 34 | [Wind_speed, Visibility, Dew_point_temperature... | 5295.360484 |
| 39 | [] | 5432.826142 |
| 33 | [Wind_speed, Visibility, Dew_point_temperature... | 5398.588404 |
| 32 | [Wind_speed, Visibility, Dew_point_temperature... | 5366.285955 |
| 31 | [Wind_speed, Visibility, Dew_point_temperature... | 5607.162315 |
| 30 | [Wind_speed, Visibility, Dew_point_temperature... | 5892.107155 |
| 29 | [Wind_speed, Visibility, Dew_point_temperature... | 5492.028884 |
| 28 | [Wind_speed, Visibility, Dew_point_temperature... | 5536.688482 |
| 27 | [Wind_speed, Visibility, Dew_point_temperature... | 5475.068059 |
| 26 | [Wind_speed, Visibility, Dew_point_temperature... | 5193.849535 |
| 25 | [Wind_speed, Visibility, Dew_point_temperature... | 4908.256181 |
Based on both of the selection methods, the following variables will be removed:
Dew_point_temperatureWind speedVisibility
Shrinkage Method¶
To further determine which variables are unnecessary for our analysis, Elastic Net will also be used for variable selection.
Elastic Net¶
# Create a list of possible alphas
potential_alphas = np.logspace(-4, 2, 500)
# run cross-validation to find the best alpha and l1_ratio
ElasticNetCV_model = ElasticNetCV(alphas = potential_alphas,
cv = 5,
random_state = 12345,
max_iter = 10000,
fit_intercept = True)
ElasticNetCV_model.fit(X, y)
# Get the list of alphas and corresponding MSEs
alphas = ElasticNetCV_model.alphas_
pmse_means = np.mean(ElasticNetCV_model.mse_path_, axis = 1)
pmse_std_error = np.std(ElasticNetCV_model.mse_path_,
axis = 1,
ddof = 1) / np.sqrt(5) # 5-fold CV
# Find the alpha that minimizes MSE
alpha_index_min = np.argmin(pmse_means)
alpha_min = alphas[alpha_index_min]
# Find the MSE that is one standard error away from the minimum MSE
one_se_above_min = min(pmse_means) + pmse_std_error[alpha_index_min]
# Find the largetst alpha with MSE less than or equal to one_se_above_min
alpha_index_1se = np.where(pmse_means <= one_se_above_min)[0][0]
alpha_1se = alphas[alpha_index_1se]
print("Minimum alpha:", alpha_min)
print("One SE alpha:", alpha_1se)
Minimum alpha: 0.0006945177773823696 One SE alpha: 0.015009032755797366
plt.plot(alphas,
pmse_means)
plt.fill_between(alphas,
pmse_means + pmse_std_error,
pmse_means - pmse_std_error,
alpha = 0.1)
plt.axhline(one_se_above_min,
linestyle = 'dotted',
label = 'Best + One SE')
plt.scatter([alpha_1se],
[pmse_means[alpha_index_1se]],
marker = 'o',
color = 'orange',
label = 'One SE Rule')
plt.scatter([alpha_min],
[pmse_means[alpha_index_min]],
marker = 'o',
color = 'blue',
label = 'Minimum rule')
plt.legend()
plt.xscale('log')
plt.xlabel('lambda (alpha), plotted on log scale')
plt.ylabel('MSE')
plt.show()
en_1se = ElasticNet(alpha=alpha_1se, fit_intercept=True)
en_1se.fit(X, y)
# Get coefficients and feature names
coefficients = en_1se.coef_
feature_names = X.columns
# Create a DataFrame with feature names and their coefficients
coef_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients
})
# Sort the DataFrame by the absolute value of coefficients in descending order
coef_df = coef_df.reindex(coef_df['Coefficient'].abs().sort_values(ascending=True).index)
coef_df
| Feature | Coefficient | |
|---|---|---|
| 3 | Visibility | 0.031382 |
| 4 | Dew_point_temperature | -0.175916 |
| 2 | Wind_speed | -3.160495 |
| 1 | Humidity | -5.608837 |
| 24 | Hour_9 | 16.811689 |
| 5 | Solar_Radiation | 23.901443 |
| 0 | Temperature | 29.627418 |
| 31 | Hour_16 | 31.051772 |
| 38 | Hour_23 | 34.416582 |
| 6 | Rainfall | -36.211208 |
| 12 | Day_Saturday | 41.390766 |
| 14 | Day_Tuesday | 54.661669 |
| 11 | Day_Monday | 56.873123 |
| 22 | Hour_7 | 80.973063 |
| 7 | Snowfall | 87.721700 |
| 30 | Hour_15 | -88.994358 |
| 13 | Day_Thursday | 95.131128 |
| 15 | Day_Wednesday | 104.115818 |
| 8 | Holiday | -108.229479 |
| 10 | Day_Friday | 126.900409 |
| 27 | Hour_12 | -133.034942 |
| 28 | Hour_13 | -133.497925 |
| 29 | Hour_14 | -140.368917 |
| 16 | Hour_1 | -144.855558 |
| 25 | Hour_10 | -166.821163 |
| 26 | Hour_11 | -175.339200 |
| 21 | Hour_6 | -190.553157 |
| 17 | Hour_2 | -234.453959 |
| 32 | Hour_17 | 252.342759 |
| 37 | Hour_22 | 255.506707 |
| 9 | Precipitation | -264.021118 |
| 18 | Hour_3 | -279.254087 |
| 35 | Hour_20 | 335.184241 |
| 36 | Hour_21 | 338.152864 |
| 20 | Hour_5 | -353.315454 |
| 19 | Hour_4 | -359.243228 |
| 34 | Hour_19 | 372.318838 |
| 23 | Hour_8 | 388.751598 |
| 33 | Hour_18 | 636.805237 |
Visibility, Dew_point_temperature, and Wind_Speed all have the lowest values, matching the conclusion identified with the selection values.
Choosing Subset¶
subset_columns = list(X.columns)
subset_columns.remove('Wind_speed')
subset_columns.remove('Dew_point_temperature')
subset_columns.remove('Visibility')
subset_columns
['Temperature', 'Humidity', 'Solar_Radiation', 'Rainfall', 'Snowfall', 'Holiday', 'Precipitation', 'Day_Friday', 'Day_Monday', 'Day_Saturday', 'Day_Thursday', 'Day_Tuesday', 'Day_Wednesday', 'Hour_1', 'Hour_2', 'Hour_3', 'Hour_4', 'Hour_5', 'Hour_6', 'Hour_7', 'Hour_8', 'Hour_9', 'Hour_10', 'Hour_11', 'Hour_12', 'Hour_13', 'Hour_14', 'Hour_15', 'Hour_16', 'Hour_17', 'Hour_18', 'Hour_19', 'Hour_20', 'Hour_21', 'Hour_22', 'Hour_23']
Initial Linear Regression¶
should we just remove snowfall? then we don't have to transform it and it's already the next option to be removed for all selection methods and not significant below
y = bike_df_dummy['Rented_Bike_Count']
X_subset = sm.add_constant(bike_df_dummy[subset_columns]) # column names from subset
mod = sm.OLS(y, X_subset)
res = mod.fit()
res.summary()
| Dep. Variable: | Rented_Bike_Count | R-squared: | 0.654 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.652 |
| Method: | Least Squares | F-statistic: | 260.8 |
| Date: | Thu, 07 Nov 2024 | Prob (F-statistic): | 0.00 |
| Time: | 19:39:19 | Log-Likelihood: | -36794. |
| No. Observations: | 5000 | AIC: | 7.366e+04 |
| Df Residuals: | 4963 | BIC: | 7.390e+04 |
| Df Model: | 36 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 519.1092 | 37.313 | 13.912 | 0.000 | 445.959 | 592.260 |
| Temperature | 28.4160 | 0.583 | 48.726 | 0.000 | 27.273 | 29.559 |
| Humidity | -5.2404 | 0.378 | -13.865 | 0.000 | -5.981 | -4.499 |
| Solar_Radiation | 57.1121 | 12.799 | 4.462 | 0.000 | 32.020 | 82.204 |
| Rainfall | -34.3790 | 5.192 | -6.621 | 0.000 | -44.559 | -24.200 |
| Snowfall | 99.9393 | 13.583 | 7.358 | 0.000 | 73.311 | 126.567 |
| Holiday | -122.1547 | 25.375 | -4.814 | 0.000 | -171.900 | -72.409 |
| Precipitation | -305.7318 | 22.684 | -13.478 | 0.000 | -350.202 | -261.262 |
| Day_Friday | 156.7885 | 19.786 | 7.924 | 0.000 | 118.000 | 195.577 |
| Day_Monday | 82.4004 | 19.848 | 4.152 | 0.000 | 43.490 | 121.311 |
| Day_Saturday | 65.0520 | 20.178 | 3.224 | 0.001 | 25.494 | 104.610 |
| Day_Thursday | 124.2522 | 20.075 | 6.189 | 0.000 | 84.897 | 163.608 |
| Day_Tuesday | 86.3347 | 20.353 | 4.242 | 0.000 | 46.434 | 126.235 |
| Day_Wednesday | 137.0093 | 20.092 | 6.819 | 0.000 | 97.620 | 176.398 |
| Hour_1 | -105.0012 | 37.973 | -2.765 | 0.006 | -179.445 | -30.558 |
| Hour_2 | -208.0865 | 37.681 | -5.522 | 0.000 | -281.958 | -134.215 |
| Hour_3 | -267.8847 | 38.870 | -6.892 | 0.000 | -344.088 | -191.682 |
| Hour_4 | -365.8599 | 38.818 | -9.425 | 0.000 | -441.960 | -289.760 |
| Hour_5 | -354.5664 | 38.770 | -9.145 | 0.000 | -430.574 | -278.559 |
| Hour_6 | -162.9542 | 38.510 | -4.231 | 0.000 | -238.451 | -87.458 |
| Hour_7 | 152.8001 | 39.047 | 3.913 | 0.000 | 76.250 | 229.350 |
| Hour_8 | 513.0551 | 39.099 | 13.122 | 0.000 | 436.403 | 589.707 |
| Hour_9 | 59.4768 | 39.203 | 1.517 | 0.129 | -17.379 | 136.333 |
| Hour_10 | -171.0541 | 41.724 | -4.100 | 0.000 | -252.851 | -89.257 |
| Hour_11 | -184.2710 | 42.286 | -4.358 | 0.000 | -267.171 | -101.371 |
| Hour_12 | -138.5978 | 43.344 | -3.198 | 0.001 | -223.572 | -53.624 |
| Hour_13 | -139.2865 | 43.152 | -3.228 | 0.001 | -223.883 | -54.690 |
| Hour_14 | -143.2130 | 43.014 | -3.329 | 0.001 | -227.539 | -58.887 |
| Hour_15 | -73.0030 | 41.685 | -1.751 | 0.080 | -154.724 | 8.718 |
| Hour_16 | 74.7935 | 40.577 | 1.843 | 0.065 | -4.755 | 154.342 |
| Hour_17 | 348.9732 | 39.771 | 8.775 | 0.000 | 271.004 | 426.942 |
| Hour_18 | 810.5066 | 38.351 | 21.134 | 0.000 | 735.321 | 885.692 |
| Hour_19 | 507.7543 | 38.366 | 13.234 | 0.000 | 432.539 | 582.969 |
| Hour_20 | 467.8874 | 38.365 | 12.196 | 0.000 | 392.674 | 543.101 |
| Hour_21 | 468.3512 | 38.025 | 12.317 | 0.000 | 393.806 | 542.897 |
| Hour_22 | 374.5283 | 39.092 | 9.581 | 0.000 | 297.890 | 451.166 |
| Hour_23 | 106.4905 | 38.087 | 2.796 | 0.005 | 31.823 | 181.158 |
| Omnibus: | 119.422 | Durbin-Watson: | 1.962 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 168.154 |
| Skew: | 0.269 | Prob(JB): | 3.06e-37 |
| Kurtosis: | 3.720 | Cond. No. | 1.67e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.67e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Linear Assumptions¶
Now that we've determined a subset for our model, we can check linear assumptions. We will then transform any variables that do not meet linear assumptions.
Linear¶
Are the Xs vs Y linear?
continuous_columns = bike_df.select_dtypes(include=['float64', 'int64']).columns.tolist()
continuous_columns = [col for col in continuous_columns if col in subset_columns]
# Plot partial regression plots, excluding categorical variables
fig = plt.figure(figsize=(9,6))
sm.graphics.plot_partregress_grid(res,
exog_idx=continuous_columns,
grid=(2,3),
fig=fig
)
fig.suptitle('Partial Regression Plots (Continuous Variables Only)')
fig.tight_layout()
plt.show()
Rainfall and Snowfall are both extremely right-skewed, and break the assumption that the residuals are normally distributed. However, the linear assumption is met.
Normal¶
Are the residuals normally distributed and centered at zero?
# Diagnostic 1
fig = plt.figure(figsize = (4, 4))
plt.boxplot(res.resid)
plt.ylabel("Residuals")
plt.title("Boxplot of Residuals")
plt.show()
# Diagnostic 2
fig = plt.figure(figsize = (4, 4))
plt.hist(res.resid,
density = True,
bins = 11)
plt.xlabel("Residuals")
plt.ylabel("Density")
mean = np.mean(res.resid)
sd = np.std(res.resid)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
plt.plot(x,
stats.norm.pdf(x, mean, sd),
color = 'r',
lw = 3)
plt.title("Histogram of Residuals")
plt.show()
# Diagnostic 3
fig = plt.figure(figsize = (4, 4))
stats.probplot(res.resid,
plot = plt)
plt.title("Q-Q Plot of Residuals")
plt.show()
The boxplot and histogram look really promising. Both depict the residuals as normally distributed. The dataset as a whole meets the assumption that the residuals are normally distribute, however, as mentioned in the Linear section, 2 variables do not and they will be investigated later.
Equal/Constant Variance¶
Do the residuals have equal/constant variance across all values of X?
fig = plt.figure(figsize = (4, 4))
sns.residplot(x = res.fittedvalues,
y = res.resid,
lowess = True,
scatter_kws = {'s': 3},
line_kws = {'color': 'red', 'lw': 1})
plt.title("Residuals vs. Fitted Values")
plt.ylabel("Residuals")
plt.xlabel("Fitted Values")
plt.show()
A distinct line shows where the Fitted Values value can't exist below the same value of the negative Residual. Otherwise, the shape is generally cloud-like and homoscedastic.
However, based on the partial regression plots for Rainfall and Snowfall do not exhibit constant variance, so the assumption is not met.
Influential Points¶
Does the model describe all observations?
# DFFITS
bike_df_dummy['res_dffits'] = res.get_influence().dffits[0]
fig = plt.figure(figsize = (4, 4))
plt.ylabel("DFFITS (Absolute Values)")
plt.xlabel("Observation Number")
plt.scatter(bike_df_dummy.index,
np.abs(bike_df_dummy['res_dffits']),
s = 3)
threshold = 2 * np.sqrt(len(res.params) / len(bike_df_dummy))
plt.axhline(y = threshold,
color = 'r',
linestyle = 'dashed')
influential_points = bike_df_dummy[np.abs(bike_df_dummy['res_dffits']) > threshold]
for i in influential_points.index:
plt.annotate(str(i),
(i, np.abs(bike_df_dummy['res_dffits'][i])),
textcoords="offset points",
xytext=(0, 5),
ha='center',
fontsize=8)
plt.show()
outlier_indicies = [3997, 4012, 6498, 8602, 564]
bike_df_dummy.loc[outlier_indicies][['Rented_Bike_Count']+subset_columns]
| Rented_Bike_Count | Temperature | Humidity | Solar_Radiation | Rainfall | Snowfall | Holiday | Precipitation | Day_Friday | Day_Monday | ... | Hour_14 | Hour_15 | Hour_16 | Hour_17 | Hour_18 | Hour_19 | Hour_20 | Hour_21 | Hour_22 | Hour_23 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3997 | 151 | 21.8 | 97 | 0.06 | 35.0 | 0.0 | 0.0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4012 | 51 | 19.2 | 98 | 0.00 | 19.0 | 0.0 | 0.0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6498 | 96 | 24.3 | 97 | 0.01 | 18.5 | 0.0 | 0.0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 8602 | 62 | 0.4 | 97 | 0.00 | 0.0 | 8.8 | 0.0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 564 | 4 | 4.1 | 91 | 0.07 | 9.5 | 0.0 | 1.0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 37 columns
The biggest outlier found with DFBETAS and DFFITS was observation 3997. It appears that there was an extreme of rain that hour, but 35mm for the hour is a possible observation so no action will be taken to remove it and the assumption the model describes all observations is met.
Multicollinearity¶
Is there any multicollinearity among the variables?
# Variance Inflation Factors (VIF)
env_vifs = pd.DataFrame()
env_vifs['Feature'] = X_subset.columns[1:]
env_vifs['VIF'] = [vif(X_subset, i) for i in range(1, len(X_subset.columns))]
print("Max = ", max(env_vifs['VIF']))
print("Mean = ", np.mean(env_vifs['VIF']))
env_vifs.sort_values(by = 'VIF', ascending=False)
Max = 4.171058615760434 Mean = 2.044229346121867
| Feature | VIF | |
|---|---|---|
| 2 | Solar_Radiation | 4.171059 |
| 25 | Hour_13 | 2.752592 |
| 24 | Hour_12 | 2.600201 |
| 26 | Hour_14 | 2.514029 |
| 23 | Hour_11 | 2.497341 |
| 27 | Hour_15 | 2.426806 |
| 28 | Hour_16 | 2.309849 |
| 21 | Hour_9 | 2.223743 |
| 22 | Hour_10 | 2.200112 |
| 30 | Hour_18 | 2.128150 |
| 29 | Hour_17 | 2.109292 |
| 14 | Hour_2 | 2.107756 |
| 33 | Hour_21 | 2.073898 |
| 13 | Hour_1 | 2.068226 |
| 31 | Hour_19 | 2.065050 |
| 32 | Hour_20 | 2.055688 |
| 35 | Hour_23 | 2.053373 |
| 1 | Humidity | 2.042612 |
| 18 | Hour_6 | 2.033831 |
| 20 | Hour_8 | 2.028950 |
| 17 | Hour_5 | 1.994974 |
| 16 | Hour_4 | 1.990329 |
| 15 | Hour_3 | 1.986140 |
| 19 | Hour_7 | 1.984942 |
| 34 | Hour_22 | 1.950746 |
| 6 | Precipitation | 1.760281 |
| 7 | Day_Friday | 1.714972 |
| 0 | Temperature | 1.714516 |
| 8 | Day_Monday | 1.699134 |
| 10 | Day_Thursday | 1.678970 |
| 9 | Day_Saturday | 1.678168 |
| 12 | Day_Wednesday | 1.675836 |
| 11 | Day_Tuesday | 1.649322 |
| 4 | Snowfall | 1.377807 |
| 3 | Rainfall | 1.261509 |
| 5 | Holiday | 1.012054 |
correlation_matrix = X_subset.corr()
rounded_correlation_matrix = correlation_matrix.round(2)
# Create a custom mask where correlations less than ±0.5 are not shown
mask = np.abs(rounded_correlation_matrix) < 0.5
# Create the heatmap with only significant correlations displayed
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix,
cmap='coolwarm', # color palette
annot=rounded_correlation_matrix.where(~mask).fillna(''), # only show values > ±0.5
fmt="", # format the annotation to 2 decimal places
vmin=-1, # start color legend at -1
vmax=1, # end color legend at 1
mask=np.triu(correlation_matrix)) # remove the upper half of the plot
plt.title("Correlation Heatmap of Selected Variables")
plt.show()
The VIF and the correlation values both look promising with the highest variable being Solar Radiation with a value of 4.17. While a value above 4 is not ideal, because it doesn't show up to be taken out in our variable selection, we will continue with the relatively high value. In the correlation matrix, there isn't a single value over 0.5. Therefore, the assumption of no multicollinearity is met.
Linear Assumptions Conclusion¶
Are the X's and Y linear? Met
Are the residuals independant? Met
While there was a data column identifying the time and hour, both were used as categorical varaibles with Date converted to Day and making Hour a categorical variable. Otherwise, the residuals were collected in a way that would facilitate independance.
Are the residuals normally distibuted and centered at zero? Met
Do the residuals have equal/constant variance across all values of X? Not met
Are there any influential points? Does the model describe all observations? Met
Are any additional predictor variables required? Met
We have a lot of predictor variables and it seems unlikely that there would be other variables that are needed for the model. Most possible weather variables are included as well as date variables to act as predictors.
- Is there any multicollinearity present? Met
Transformations¶
In order to meet the linear assumptions necessary for multiple linear regression, the variables Snowfall and Rainfall both need to be transformed.
Finding the Best Transformations¶
subset_columns.remove('Snowfall')
subset_columns.remove('Rainfall')
subset_columns
['Temperature', 'Humidity', 'Solar_Radiation', 'Holiday', 'Precipitation', 'Day_Friday', 'Day_Monday', 'Day_Saturday', 'Day_Thursday', 'Day_Tuesday', 'Day_Wednesday', 'Hour_1', 'Hour_2', 'Hour_3', 'Hour_4', 'Hour_5', 'Hour_6', 'Hour_7', 'Hour_8', 'Hour_9', 'Hour_10', 'Hour_11', 'Hour_12', 'Hour_13', 'Hour_14', 'Hour_15', 'Hour_16', 'Hour_17', 'Hour_18', 'Hour_19', 'Hour_20', 'Hour_21', 'Hour_22', 'Hour_23']
# go through the transformations √(X), log(X), log(log(X)), 1/√(X), 1/X and print the scatter plot of variable vs. the dependent variable ('Rented Bike Count')
def transform_and_plot(df, variable):
# Transformations
transformations = {
f'sqrt(\'{variable}\')': np.sqrt(df[variable]),
f'log(\'{variable}\')': np.log(df[variable]+1),
f'1/sqrt(\'{variable}\')': 1 / (np.sqrt(df[variable])+1),
f'1/\'{variable}\'': 1 / (df[variable]+1)
}
# Create subplots
fig, axes = plt.subplots(1, 4, figsize=(15, 4))
axes = axes.ravel() # Flatten the axes array for easier iteration
# Iterate through transformations and create plots
for i, (name, transformed_y) in enumerate(transformations.items()):
axes[i].scatter(df['Rented_Bike_Count'], transformed_y)
axes[i].set_xlabel('Rented Bike Count')
axes[i].set_ylabel(name)
axes[i].set_title(f'\'Rented Bike Count\' vs. {name}')
# Remove any unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.suptitle(f'Scatter Plots of \'Rented Bike Count\' vs. Transformed \'{variable}\'', fontsize=16)
plt.tight_layout()
plt.show()
transform_and_plot(bike_df_dummy, 'Snowfall')
transform_and_plot(bike_df_dummy, 'Rainfall')
Although none of the transformations could completely get rid of the extreme right skew, the log(var+1) looks like the best option for both variables.
# now add the transformed columns to the dataframe and the names to the subset of columns variable
bike_df_dummy['Snowfall_log'] = np.log(bike_df_dummy['Snowfall']+1)
bike_df_dummy['Rainfall_log'] = np.log(bike_df_dummy['Rainfall']+1)
subset_columns.append("Snowfall_log")
subset_columns.append("Rainfall_log")
subset_columns
['Temperature', 'Humidity', 'Solar_Radiation', 'Holiday', 'Precipitation', 'Day_Friday', 'Day_Monday', 'Day_Saturday', 'Day_Thursday', 'Day_Tuesday', 'Day_Wednesday', 'Hour_1', 'Hour_2', 'Hour_3', 'Hour_4', 'Hour_5', 'Hour_6', 'Hour_7', 'Hour_8', 'Hour_9', 'Hour_10', 'Hour_11', 'Hour_12', 'Hour_13', 'Hour_14', 'Hour_15', 'Hour_16', 'Hour_17', 'Hour_18', 'Hour_19', 'Hour_20', 'Hour_21', 'Hour_22', 'Hour_23', 'Snowfall_log', 'Rainfall_log']
Linear Regression After Transformation¶
X_subset_trans = sm.add_constant(bike_df_dummy[subset_columns])
mod = sm.OLS(y, X_subset_trans)
res_trans = mod.fit()
res_trans.summary()
| Dep. Variable: | Rented_Bike_Count | R-squared: | 0.659 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.656 |
| Method: | Least Squares | F-statistic: | 266.2 |
| Date: | Thu, 07 Nov 2024 | Prob (F-statistic): | 0.00 |
| Time: | 19:40:34 | Log-Likelihood: | -36760. |
| No. Observations: | 5000 | AIC: | 7.359e+04 |
| Df Residuals: | 4963 | BIC: | 7.384e+04 |
| Df Model: | 36 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 489.0899 | 37.213 | 13.143 | 0.000 | 416.135 | 562.045 |
| Temperature | 28.9304 | 0.583 | 49.587 | 0.000 | 27.787 | 30.074 |
| Humidity | -4.8701 | 0.378 | -12.886 | 0.000 | -5.611 | -4.129 |
| Solar_Radiation | 51.6667 | 12.726 | 4.060 | 0.000 | 26.718 | 76.615 |
| Holiday | -121.9669 | 25.206 | -4.839 | 0.000 | -171.382 | -72.551 |
| Precipitation | -260.8422 | 29.112 | -8.960 | 0.000 | -317.915 | -203.770 |
| Day_Friday | 158.4079 | 19.658 | 8.058 | 0.000 | 119.870 | 196.945 |
| Day_Monday | 84.7764 | 19.718 | 4.299 | 0.000 | 46.121 | 123.432 |
| Day_Saturday | 67.9168 | 20.039 | 3.389 | 0.001 | 28.632 | 107.201 |
| Day_Thursday | 119.2899 | 19.947 | 5.980 | 0.000 | 80.186 | 158.394 |
| Day_Tuesday | 84.8379 | 20.218 | 4.196 | 0.000 | 45.201 | 124.474 |
| Day_Wednesday | 131.3576 | 19.965 | 6.580 | 0.000 | 92.218 | 170.497 |
| Hour_1 | -107.2017 | 37.726 | -2.842 | 0.005 | -181.162 | -33.241 |
| Hour_2 | -208.4258 | 37.432 | -5.568 | 0.000 | -281.809 | -135.043 |
| Hour_3 | -263.3751 | 38.621 | -6.820 | 0.000 | -339.089 | -187.661 |
| Hour_4 | -366.2829 | 38.560 | -9.499 | 0.000 | -441.877 | -290.689 |
| Hour_5 | -356.4273 | 38.518 | -9.254 | 0.000 | -431.939 | -280.915 |
| Hour_6 | -159.5245 | 38.258 | -4.170 | 0.000 | -234.528 | -84.521 |
| Hour_7 | 152.7409 | 38.789 | 3.938 | 0.000 | 76.698 | 228.784 |
| Hour_8 | 514.6343 | 38.837 | 13.251 | 0.000 | 438.497 | 590.772 |
| Hour_9 | 72.0602 | 38.965 | 1.849 | 0.064 | -4.328 | 148.448 |
| Hour_10 | -159.2182 | 41.453 | -3.841 | 0.000 | -240.484 | -77.952 |
| Hour_11 | -169.3317 | 42.016 | -4.030 | 0.000 | -251.701 | -86.962 |
| Hour_12 | -118.7849 | 43.105 | -2.756 | 0.006 | -203.290 | -34.280 |
| Hour_13 | -125.9206 | 42.855 | -2.938 | 0.003 | -209.936 | -41.906 |
| Hour_14 | -128.7665 | 42.746 | -3.012 | 0.003 | -212.568 | -44.965 |
| Hour_15 | -60.3220 | 41.423 | -1.456 | 0.145 | -141.529 | 20.885 |
| Hour_16 | 86.3201 | 40.321 | 2.141 | 0.032 | 7.273 | 165.367 |
| Hour_17 | 359.5858 | 39.523 | 9.098 | 0.000 | 282.103 | 437.068 |
| Hour_18 | 818.3673 | 38.107 | 21.476 | 0.000 | 743.661 | 893.073 |
| Hour_19 | 513.6195 | 38.115 | 13.475 | 0.000 | 438.897 | 588.342 |
| Hour_20 | 468.0701 | 38.101 | 12.285 | 0.000 | 393.375 | 542.765 |
| Hour_21 | 472.6948 | 37.776 | 12.513 | 0.000 | 398.638 | 546.752 |
| Hour_22 | 378.4853 | 38.835 | 9.746 | 0.000 | 302.352 | 454.618 |
| Hour_23 | 110.2945 | 37.837 | 2.915 | 0.004 | 36.117 | 184.472 |
| Snowfall_log | 229.2104 | 35.850 | 6.394 | 0.000 | 158.928 | 299.493 |
| Rainfall_log | -215.7351 | 26.475 | -8.149 | 0.000 | -267.638 | -163.832 |
| Omnibus: | 121.184 | Durbin-Watson: | 1.963 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 174.894 |
| Skew: | 0.264 | Prob(JB): | 1.05e-38 |
| Kurtosis: | 3.748 | Cond. No. | 1.67e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.67e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Check Assumptions After Transformations¶
Linear¶
Are the Xs vs Y linear?
continuous_columns.remove('Snowfall')
continuous_columns.remove('Rainfall')
continuous_columns.append('Snowfall_log')
continuous_columns.append('Rainfall_log')
# Plot partial regression plots, excluding categorical variables
fig = plt.figure(figsize=(9,6))
sm.graphics.plot_partregress_grid(res_trans,
exog_idx=continuous_columns,
grid=(2,3),
fig=fig
)
fig.suptitle('Partial Regression Plots (Continuous Variables Only)')
fig.tight_layout()
plt.show()
The X's and Y still appear linear post transformation. The assumption is met.
The variance on Snowfall_log and Rainfall_log even looks better after the transformation.
Normal¶
Are the residuals normally distributed and centered at zero?
# Diagnostic 1
fig = plt.figure(figsize = (4, 4))
plt.boxplot(res_trans.resid)
plt.ylabel("Residuals")
plt.title("Boxplot of Residuals")
plt.show()
# Diagnostic 2
fig = plt.figure(figsize = (4, 4))
plt.hist(res_trans.resid,
density = True,
bins = 11)
plt.xlabel("Residuals")
plt.ylabel("Density")
mean = np.mean(res_trans.resid)
sd = np.std(res_trans.resid)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
plt.plot(x,
stats.norm.pdf(x, mean, sd),
color = 'r',
lw = 3)
plt.title("Histogram of Residuals")
plt.show()
# Diagnostic 3
fig = plt.figure(figsize = (4, 4))
stats.probplot(res_trans.resid,
plot = plt)
plt.title("Q-Q Plot of Residuals")
plt.show()
Normality of the residuals is unchanged after the transformation. The assumption is met.
Equal/Constant Variance¶
Do the residuals have equal/constant variance across all values of X?
fig = plt.figure(figsize = (4, 4))
sns.residplot(x = res_trans.fittedvalues,
y = res_trans.resid,
lowess = True,
scatter_kws = {'s': 3},
line_kws = {'color': 'red', 'lw': 1})
plt.title("Residuals vs. Fitted")
plt.ylabel("Residuals")
plt.xlabel("Fitted Values")
plt.show()
Although the graph still look somewhat strange, it actually looks better post transformation, and the assumption is still met.
Influential Points¶
Does the model describe all observations?
# DFFITS
bike_df_dummy['res_trans_dffits'] = res.get_influence().dffits[0]
fig = plt.figure(figsize = (4, 4))
plt.ylabel("DFFITS (Absolute Values)")
plt.xlabel("Observation Number")
plt.scatter(bike_df_dummy.index,
np.abs(bike_df_dummy['res_trans_dffits']),
s = 3)
threshold = 2 * np.sqrt(len(res.params) / len(bike_df_dummy))
plt.axhline(y = threshold,
color = 'r',
linestyle = 'dashed')
influential_points = bike_df_dummy[np.abs(bike_df_dummy['res_trans_dffits']) > threshold]
for i in influential_points.index:
plt.annotate(str(i),
(i, np.abs(bike_df_dummy['res_trans_dffits'][i])),
textcoords="offset points",
xytext=(0, 5),
ha='center',
fontsize=8)
plt.show()
Perhaps unsurprisingly, the spread of outliers is greater now that the transformations have been performed. Most of the high values are the same and don't necessitate any further investigation. Although we would like to see less outliers affecting the model, this is acceptable for our current model.
Multicollinearity¶
Is there any multicollinearity among the variables?
# Variance Inflation Factors (VIF)
env_vifs = pd.DataFrame()
env_vifs['Feature'] = X_subset_trans.columns[1:]
env_vifs['VIF'] = [vif(X_subset, i) for i in range(1, len(X_subset.columns))]
print("Max = ", max(env_vifs['VIF']))
print("Mean = ", np.mean(env_vifs['VIF']))
env_vifs.sort_values(by = 'VIF', ascending=False)
Max = 4.171058615760434 Mean = 2.044229346121867
| Feature | VIF | |
|---|---|---|
| 2 | Solar_Radiation | 4.171059 |
| 25 | Hour_15 | 2.752592 |
| 24 | Hour_14 | 2.600201 |
| 26 | Hour_16 | 2.514029 |
| 23 | Hour_13 | 2.497341 |
| 27 | Hour_17 | 2.426806 |
| 28 | Hour_18 | 2.309849 |
| 21 | Hour_11 | 2.223743 |
| 22 | Hour_12 | 2.200112 |
| 30 | Hour_20 | 2.128150 |
| 29 | Hour_19 | 2.109292 |
| 14 | Hour_4 | 2.107756 |
| 33 | Hour_23 | 2.073898 |
| 13 | Hour_3 | 2.068226 |
| 31 | Hour_21 | 2.065050 |
| 32 | Hour_22 | 2.055688 |
| 35 | Rainfall_log | 2.053373 |
| 1 | Humidity | 2.042612 |
| 18 | Hour_8 | 2.033831 |
| 20 | Hour_10 | 2.028950 |
| 17 | Hour_7 | 1.994974 |
| 16 | Hour_6 | 1.990329 |
| 15 | Hour_5 | 1.986140 |
| 19 | Hour_9 | 1.984942 |
| 34 | Snowfall_log | 1.950746 |
| 6 | Day_Monday | 1.760281 |
| 7 | Day_Saturday | 1.714972 |
| 0 | Temperature | 1.714516 |
| 8 | Day_Thursday | 1.699134 |
| 10 | Day_Wednesday | 1.678970 |
| 9 | Day_Tuesday | 1.678168 |
| 12 | Hour_2 | 1.675836 |
| 11 | Hour_1 | 1.649322 |
| 4 | Precipitation | 1.377807 |
| 3 | Holiday | 1.261509 |
| 5 | Day_Friday | 1.012054 |
The high multicollinearity values are nearly identical as Solar Radiation/Hour_... have not been touched.
Transformed Assumptions Conclusion¶
Now that the variables Snowfall and Rainfall were transformed, the assumptions all look much better. Although they are still very right skewed, all assumptions are assumed to be met now. No assumptions were made worse by transformation.
Interactions¶
In order to find what variables should be included as interaction terms, the relationships that can be graphed will be and the combination of the rest of the variables will be considered.
Plotting the Interactions¶
def make_interaction_plot(x, trace):
fig, ax = plt.subplots(figsize=(6, 4))
interaction_plot(
x=bike_df_dummy[x].to_numpy(),
trace=bike_df_dummy[trace].to_numpy(),
response=bike_df_dummy["Rented_Bike_Count"].to_numpy(),
colors=["#1b9e77", "#d95f02"],
markers=["o", "^"],
ms=8,
ax=ax
)
ax.set_xlabel(x)
ax.set_ylabel("Average Bike Demand")
plt.title(" ".join(["Interaction:", trace, "×", x]))
plt.show()
interaction_column = "_".join([x,trace,"Interaction"])
bike_df_dummy[interaction_column] = bike_df_dummy[x] * bike_df_dummy[trace]
# Combine subset_columns and the interaction term into a single DataFrame
X = pd.concat([bike_df_dummy[subset_columns], bike_df_dummy[interaction_column]], axis=1)
# Fit the OLS model
X = sm.add_constant(X)
res_inter = sm.OLS(y, X).fit()
print(interaction_column,":",sep="")
print(anova_lm(res_trans, res_inter))
make_interaction_plot(x="Temperature", trace="Precipitation",)
make_interaction_plot(x="Temperature", trace="Holiday",)
make_interaction_plot(x="Humidity", trace="Precipitation",)
make_interaction_plot(x="Humidity", trace="Holiday",)
make_interaction_plot(x="Visibility", trace="Precipitation",)
make_interaction_plot(x="Visibility", trace="Holiday",)
make_interaction_plot(x="Solar_Radiation", trace="Precipitation",)
make_interaction_plot(x="Solar_Radiation", trace="Holiday",)
make_interaction_plot(x="Holiday", trace="Precipitation",)
Temperature_Precipitation_Interaction: df_resid ssr df_diff ss_diff F Pr(>F) 0 4963.0 7.118586e+08 0.0 NaN NaN NaN 1 4962.0 6.995468e+08 1.0 1.231179e+07 87.329527 1.354771e-20
Temperature_Holiday_Interaction: df_resid ssr df_diff ss_diff F Pr(>F) 0 4963.0 7.118586e+08 0.0 NaN NaN NaN 1 4962.0 7.111053e+08 1.0 753291.800479 5.256372 0.021908
Humidity_Precipitation_Interaction: df_resid ssr df_diff ss_diff F Pr(>F) 0 4963.0 7.118586e+08 0.0 NaN NaN NaN 1 4962.0 7.088517e+08 1.0 3.006886e+06 21.048362 0.000005
Humidity_Holiday_Interaction: df_resid ssr df_diff ss_diff F Pr(>F) 0 4963.0 7.118586e+08 0.0 NaN NaN NaN 1 4962.0 7.114416e+08 1.0 417049.830854 2.908744 0.088163
Visibility_Precipitation_Interaction: df_resid ssr df_diff ss_diff F Pr(>F) 0 4963.0 7.118586e+08 0.0 NaN NaN NaN 1 4962.0 7.109996e+08 1.0 858984.436697 5.994772 0.014383
Visibility_Holiday_Interaction: df_resid ssr df_diff ss_diff F Pr(>F) 0 4963.0 7.118586e+08 0.0 NaN NaN NaN 1 4962.0 7.112644e+08 1.0 594233.782483 4.145558 0.041797
Solar_Radiation_Precipitation_Interaction: df_resid ssr df_diff ss_diff F Pr(>F) 0 4963.0 7.118586e+08 0.0 NaN NaN NaN 1 4962.0 7.102290e+08 1.0 1.629592e+06 11.385108 0.000746
Solar_Radiation_Holiday_Interaction: df_resid ssr df_diff ss_diff F Pr(>F) 0 4963.0 7.118586e+08 0.0 NaN NaN NaN 1 4962.0 7.104978e+08 1.0 1.360842e+06 9.503896 0.002062
Holiday_Precipitation_Interaction: df_resid ssr df_diff ss_diff F Pr(>F) 0 4963.0 7.118586e+08 0.0 NaN NaN NaN 1 4962.0 7.117689e+08 1.0 89661.310396 0.625062 0.42921
Testing All Interactions¶
A combination of all variables will be tested, but to maintain interpretability the combinations including the dummy coded categorical variables (Hour, Day_Week) will not be considered.
# functions to check all combinations of interactions against the model
def check_interactions(data, model, response_var):
# Extract the list of variables from the model model's exogenous variables
predictors = [col for col in model.model.exog_names if col != 'Intercept' and col != response_var and col != 'const']
# Prepare an empty list to store interaction results
interaction_results = []
# Loop through all combinations of variables for pairwise interactions
for var1, var2 in combinations(predictors, 2):
# Define the interaction formula
interaction_formula = f"{response_var} ~ {' + '.join(predictors)} + {var1}:{var2}"
try:
# Fit the model with the interaction term
res_inter = sm.OLS.from_formula(interaction_formula, data=data).fit()
# Perform ANOVA to compare the model with and without the interaction term
anova_results = anova_lm(model, res_inter)
# Extract the F-statistic and p-value for the interaction
f_stat = anova_results["F"].iloc[-1]
p_value = anova_results["Pr(>F)"].iloc[-1]
# Append the result to the list
interaction_results.append({
"Interaction": f"{var1}:{var2}",
"F-Statistic": f_stat,
"p-Value": p_value
})
except Exception as e:
print(f"Error with interaction {var1}:{var2} - {e}")
# Convert the results into a DataFrame for easy viewing
interaction_df = pd.DataFrame(interaction_results)
# Sort by p-value to quickly identify significant interactions
interaction_df = interaction_df.sort_values(by="p-Value").reset_index(drop=True)
return interaction_df
interaction_combos = check_interactions(bike_df_dummy, res_trans, "Rented_Bike_Count")
# only the significant interactions are shown, p-value < 0.05
# to avoid complexity, we will ignore interactions that contain 'Hour' or 'Day'
significant_interactions = interaction_combos[(~interaction_combos['Interaction'].str.contains('Hour|Day')) &
(interaction_combos['p-Value'] < 0.05)].sort_values(by='p-Value', ascending=True)
significant_interactions
| Interaction | F-Statistic | p-Value | |
|---|---|---|---|
| 1 | Temperature:Solar_Radiation | 179.743412 | 2.750770e-40 |
| 8 | Temperature:Precipitation | 87.329527 | 1.354771e-20 |
| 9 | Temperature:Humidity | 75.470694 | 4.964343e-18 |
| 26 | Humidity:Precipitation | 21.048362 | 4.588689e-06 |
| 35 | Temperature:Rainfall_log | 17.491176 | 2.936234e-05 |
| 47 | Temperature:Snowfall_log | 14.486984 | 1.428322e-04 |
| 55 | Solar_Radiation:Precipitation | 11.385108 | 7.460288e-04 |
| 57 | Solar_Radiation:Rainfall_log | 10.986434 | 9.243979e-04 |
| 65 | Solar_Radiation:Holiday | 9.503896 | 2.061618e-03 |
| 86 | Holiday:Snowfall_log | 6.582978 | 1.032496e-02 |
| 102 | Temperature:Holiday | 5.256372 | 2.190821e-02 |
Picking Interactions to Include¶
Based on the p-value from the ANOVA test being less than 0.05 (and visually inspecting the graph), all of the above interactions will be included.
for interaction in significant_interactions['Interaction']:
var1, var2 = interaction.split(':')
bike_df_dummy[interaction] = bike_df_dummy[var1] * bike_df_dummy[var2]
subset_columns.append(interaction)
subset_columns
['Temperature', 'Humidity', 'Solar_Radiation', 'Holiday', 'Precipitation', 'Day_Friday', 'Day_Monday', 'Day_Saturday', 'Day_Thursday', 'Day_Tuesday', 'Day_Wednesday', 'Hour_1', 'Hour_2', 'Hour_3', 'Hour_4', 'Hour_5', 'Hour_6', 'Hour_7', 'Hour_8', 'Hour_9', 'Hour_10', 'Hour_11', 'Hour_12', 'Hour_13', 'Hour_14', 'Hour_15', 'Hour_16', 'Hour_17', 'Hour_18', 'Hour_19', 'Hour_20', 'Hour_21', 'Hour_22', 'Hour_23', 'Snowfall_log', 'Rainfall_log', 'Temperature:Solar_Radiation', 'Temperature:Precipitation', 'Temperature:Humidity', 'Humidity:Precipitation', 'Temperature:Rainfall_log', 'Temperature:Snowfall_log', 'Solar_Radiation:Precipitation', 'Solar_Radiation:Rainfall_log', 'Solar_Radiation:Holiday', 'Holiday:Snowfall_log', 'Temperature:Holiday']
Final Linear Regression Model¶
X_subset_trans_inters = sm.add_constant(bike_df_dummy[subset_columns]) # with transformations and interactions, without ommited variables
mod = sm.OLS(y, X_subset_trans_inters)
res_final = mod.fit()
res_final.summary()
| Dep. Variable: | Rented_Bike_Count | R-squared: | 0.695 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.692 |
| Method: | Least Squares | F-statistic: | 239.6 |
| Date: | Thu, 07 Nov 2024 | Prob (F-statistic): | 0.00 |
| Time: | 19:42:35 | Log-Likelihood: | -36483. |
| No. Observations: | 5000 | AIC: | 7.306e+04 |
| Df Residuals: | 4952 | BIC: | 7.338e+04 |
| Df Model: | 47 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 148.4514 | 40.236 | 3.690 | 0.000 | 69.572 | 227.331 |
| Temperature | 59.8691 | 1.895 | 31.591 | 0.000 | 56.154 | 63.584 |
| Humidity | 0.4794 | 0.506 | 0.948 | 0.343 | -0.512 | 1.471 |
| Solar_Radiation | 292.8368 | 19.801 | 14.789 | 0.000 | 254.019 | 331.655 |
| Holiday | -198.3036 | 34.932 | -5.677 | 0.000 | -266.786 | -129.822 |
| Precipitation | 41.2258 | 120.682 | 0.342 | 0.733 | -195.364 | 277.815 |
| Day_Friday | 164.2460 | 18.682 | 8.792 | 0.000 | 127.622 | 200.870 |
| Day_Monday | 84.5795 | 18.765 | 4.507 | 0.000 | 47.792 | 121.367 |
| Day_Saturday | 48.0691 | 19.064 | 2.521 | 0.012 | 10.695 | 85.443 |
| Day_Thursday | 123.4526 | 19.070 | 6.474 | 0.000 | 86.066 | 160.839 |
| Day_Tuesday | 96.3353 | 19.359 | 4.976 | 0.000 | 58.383 | 134.288 |
| Day_Wednesday | 142.5958 | 19.254 | 7.406 | 0.000 | 104.849 | 180.343 |
| Hour_1 | -91.0219 | 35.762 | -2.545 | 0.011 | -161.131 | -20.913 |
| Hour_2 | -197.3343 | 35.469 | -5.563 | 0.000 | -266.870 | -127.798 |
| Hour_3 | -240.4850 | 36.605 | -6.570 | 0.000 | -312.248 | -168.722 |
| Hour_4 | -353.5815 | 36.551 | -9.674 | 0.000 | -425.238 | -281.925 |
| Hour_5 | -344.3330 | 36.512 | -9.431 | 0.000 | -415.912 | -272.754 |
| Hour_6 | -150.0368 | 36.258 | -4.138 | 0.000 | -221.118 | -78.956 |
| Hour_7 | 159.6955 | 36.774 | 4.343 | 0.000 | 87.602 | 231.789 |
| Hour_8 | 504.3373 | 36.856 | 13.684 | 0.000 | 432.083 | 576.592 |
| Hour_9 | 31.7588 | 37.257 | 0.852 | 0.394 | -41.281 | 104.798 |
| Hour_10 | -219.3467 | 40.140 | -5.465 | 0.000 | -298.038 | -140.655 |
| Hour_11 | -234.4488 | 41.118 | -5.702 | 0.000 | -315.059 | -153.839 |
| Hour_12 | -203.9988 | 42.645 | -4.784 | 0.000 | -287.602 | -120.395 |
| Hour_13 | -210.9076 | 42.112 | -5.008 | 0.000 | -293.466 | -128.349 |
| Hour_14 | -202.9616 | 41.737 | -4.863 | 0.000 | -284.785 | -121.138 |
| Hour_15 | -117.8568 | 40.029 | -2.944 | 0.003 | -196.332 | -39.382 |
| Hour_16 | 27.8581 | 38.664 | 0.721 | 0.471 | -47.939 | 103.656 |
| Hour_17 | 326.1449 | 37.627 | 8.668 | 0.000 | 252.380 | 399.910 |
| Hour_18 | 793.3650 | 36.172 | 21.933 | 0.000 | 722.451 | 864.279 |
| Hour_19 | 509.8775 | 36.172 | 14.096 | 0.000 | 438.964 | 580.791 |
| Hour_20 | 461.8000 | 36.124 | 12.784 | 0.000 | 390.981 | 532.619 |
| Hour_21 | 459.9039 | 35.804 | 12.845 | 0.000 | 389.712 | 530.096 |
| Hour_22 | 387.2620 | 36.804 | 10.522 | 0.000 | 315.110 | 459.414 |
| Hour_23 | 112.7908 | 35.861 | 3.145 | 0.002 | 42.488 | 183.093 |
| Snowfall_log | -5.6980 | 41.053 | -0.139 | 0.890 | -86.180 | 74.784 |
| Rainfall_log | -163.8090 | 57.310 | -2.858 | 0.004 | -276.162 | -51.456 |
| Temperature:Solar_Radiation | -12.3810 | 0.655 | -18.908 | 0.000 | -13.665 | -11.097 |
| Temperature:Precipitation | -15.5492 | 2.712 | -5.734 | 0.000 | -20.865 | -10.233 |
| Temperature:Humidity | -0.4439 | 0.030 | -14.590 | 0.000 | -0.504 | -0.384 |
| Humidity:Precipitation | -2.2102 | 1.543 | -1.433 | 0.152 | -5.235 | 0.814 |
| Temperature:Rainfall_log | 2.2960 | 3.067 | 0.749 | 0.454 | -3.716 | 8.308 |
| Temperature:Snowfall_log | -7.9942 | 8.528 | -0.937 | 0.349 | -24.714 | 8.725 |
| Solar_Radiation:Precipitation | 10.7975 | 70.618 | 0.153 | 0.878 | -127.644 | 149.240 |
| Solar_Radiation:Rainfall_log | 705.0208 | 147.252 | 4.788 | 0.000 | 416.342 | 993.699 |
| Solar_Radiation:Holiday | 60.8332 | 30.335 | 2.005 | 0.045 | 1.363 | 120.303 |
| Holiday:Snowfall_log | 358.1022 | 163.704 | 2.188 | 0.029 | 37.170 | 679.034 |
| Temperature:Holiday | 2.3316 | 2.322 | 1.004 | 0.315 | -2.220 | 6.883 |
| Omnibus: | 144.464 | Durbin-Watson: | 1.985 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 208.909 |
| Skew: | 0.303 | Prob(JB): | 4.33e-46 |
| Kurtosis: | 3.798 | Cond. No. | 3.56e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.56e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
res_final.params.sort_values()
Hour_4 -353.581506 Hour_5 -344.332951 Hour_3 -240.484982 Hour_11 -234.448835 Hour_10 -219.346651 Hour_13 -210.907640 Hour_12 -203.998836 Hour_14 -202.961575 Holiday -198.303647 Hour_2 -197.334281 Rainfall_log -163.809042 Hour_6 -150.036825 Hour_15 -117.856798 Hour_1 -91.021920 Temperature:Precipitation -15.549240 Temperature:Solar_Radiation -12.380966 Temperature:Snowfall_log -7.994164 Snowfall_log -5.697990 Humidity:Precipitation -2.210181 Temperature:Humidity -0.443873 Humidity 0.479386 Temperature:Rainfall_log 2.295985 Temperature:Holiday 2.331648 Solar_Radiation:Precipitation 10.797530 Hour_16 27.858115 Hour_9 31.758775 Precipitation 41.225782 Day_Saturday 48.069146 Temperature 59.869086 Solar_Radiation:Holiday 60.833170 Day_Monday 84.579474 Day_Tuesday 96.335278 Hour_23 112.790763 Day_Thursday 123.452644 Day_Wednesday 142.595794 const 148.451382 Hour_7 159.695499 Day_Friday 164.246034 Solar_Radiation 292.836839 Hour_17 326.144902 Holiday:Snowfall_log 358.102178 Hour_22 387.261961 Hour_21 459.903937 Hour_20 461.799975 Hour_8 504.337255 Hour_19 509.877546 Solar_Radiation:Rainfall_log 705.020756 Hour_18 793.364999 dtype: float64
# Generate a formatted string for the regression equation
equation = " + ".join([f"{coef:.2f} * {name}" if name != "Intercept" else f"{coef:.2f}"
for name, coef in res_final.params.sort_values(ascending=False).items()])
# Print the formatted equation
print(f"Rented Bike Count = {equation}")
Rented Bike Count = 793.36 * Hour_18 + 705.02 * Solar_Radiation:Rainfall_log + 509.88 * Hour_19 + 504.34 * Hour_8 + 461.80 * Hour_20 + 459.90 * Hour_21 + 387.26 * Hour_22 + 358.10 * Holiday:Snowfall_log + 326.14 * Hour_17 + 292.84 * Solar_Radiation + 164.25 * Day_Friday + 159.70 * Hour_7 + 148.45 * const + 142.60 * Day_Wednesday + 123.45 * Day_Thursday + 112.79 * Hour_23 + 96.34 * Day_Tuesday + 84.58 * Day_Monday + 60.83 * Solar_Radiation:Holiday + 59.87 * Temperature + 48.07 * Day_Saturday + 41.23 * Precipitation + 31.76 * Hour_9 + 27.86 * Hour_16 + 10.80 * Solar_Radiation:Precipitation + 2.33 * Temperature:Holiday + 2.30 * Temperature:Rainfall_log + 0.48 * Humidity + -0.44 * Temperature:Humidity + -2.21 * Humidity:Precipitation + -5.70 * Snowfall_log + -7.99 * Temperature:Snowfall_log + -12.38 * Temperature:Solar_Radiation + -15.55 * Temperature:Precipitation + -91.02 * Hour_1 + -117.86 * Hour_15 + -150.04 * Hour_6 + -163.81 * Rainfall_log + -197.33 * Hour_2 + -198.30 * Holiday + -202.96 * Hour_14 + -204.00 * Hour_12 + -210.91 * Hour_13 + -219.35 * Hour_10 + -234.45 * Hour_11 + -240.48 * Hour_3 + -344.33 * Hour_5 + -353.58 * Hour_4
Hour 18 (6:00 PM) has the largest positive slope at 822.91, followed by hour 19(7:00 PM) and hour 8 (8:00 AM) both in the mid 500s. The bike rental rate at these time is significantly higher than other times of day, while the midday (between 10 AM and 3 PM) and early morning (between 1:00 AM and 6:00 am) have a much lower demand. Weather has a strong influence over demand with an increased bike rental count of about 29 for each additional degree (celsius) and about 55 more bikes for each mega joule per square meter.
[insert analysis for Interactions with rain]
How has the model improved?¶
Model Assessment¶
# R2
print("R^2:",res_final.rsquared)
# Adjusted R2
print("Adjusted R^2:",res_final.rsquared_adj)
# F-test
print("F-statistic:",res_final.fvalue)
print("F-test p-value:",res_final.f_pvalue)
# RMSE
rmse = np.sqrt(np.sum(res_final.resid**2) / (len(bike_df_dummy) - 2))
print("RMSE:",rmse)
# MAE
mae = np.mean(np.abs(res_final.resid))
print("MAE:",mae)
R^2: 0.6945835908648202 Adjusted R^2: 0.6916848486941107 F-statistic: 239.61551250859017 F-test p-value: 0.0 RMSE: 357.0564029473311 MAE: 274.2116936290167
With a relatively high R Squared of 0.695, we are able to take some insights from our model as about 65% of the variability in Rented Bike Count can be determined by the model. This still leaves room for improvement, but allows us to make business decisions.
The RMSE and MAE are also low relative to the size of the data set, showing that we can make predictions with moderate accuracy.
Finally, the F-statistic further establishes that our model is accurate with a value of 240.0 and a P-value significantly lower than .05, indicating we can reject the null hypothesis that the model is not significant.
Statistical Inference¶
Hypothesis Tests¶
p_values = res_final.pvalues
# Print the results with a clear indication of significance
alpha = 0.05 # Significance level
print("Hypothesis Tests for Slopes:")
for feature, p_value in p_values[1:].items(): # Exclude the constant term
if p_value < alpha:
print(f"{feature}: p-value = {p_value:.3f} (Significant)")
else:
print(f"{feature}: p-value = {p_value:.3f} (Not Significant)")
Hypothesis Tests for Slopes: Temperature: p-value = 0.000 (Significant) Humidity: p-value = 0.343 (Not Significant) Solar_Radiation: p-value = 0.000 (Significant) Holiday: p-value = 0.000 (Significant) Precipitation: p-value = 0.733 (Not Significant) Day_Friday: p-value = 0.000 (Significant) Day_Monday: p-value = 0.000 (Significant) Day_Saturday: p-value = 0.012 (Significant) Day_Thursday: p-value = 0.000 (Significant) Day_Tuesday: p-value = 0.000 (Significant) Day_Wednesday: p-value = 0.000 (Significant) Hour_1: p-value = 0.011 (Significant) Hour_2: p-value = 0.000 (Significant) Hour_3: p-value = 0.000 (Significant) Hour_4: p-value = 0.000 (Significant) Hour_5: p-value = 0.000 (Significant) Hour_6: p-value = 0.000 (Significant) Hour_7: p-value = 0.000 (Significant) Hour_8: p-value = 0.000 (Significant) Hour_9: p-value = 0.394 (Not Significant) Hour_10: p-value = 0.000 (Significant) Hour_11: p-value = 0.000 (Significant) Hour_12: p-value = 0.000 (Significant) Hour_13: p-value = 0.000 (Significant) Hour_14: p-value = 0.000 (Significant) Hour_15: p-value = 0.003 (Significant) Hour_16: p-value = 0.471 (Not Significant) Hour_17: p-value = 0.000 (Significant) Hour_18: p-value = 0.000 (Significant) Hour_19: p-value = 0.000 (Significant) Hour_20: p-value = 0.000 (Significant) Hour_21: p-value = 0.000 (Significant) Hour_22: p-value = 0.000 (Significant) Hour_23: p-value = 0.002 (Significant) Snowfall_log: p-value = 0.890 (Not Significant) Rainfall_log: p-value = 0.004 (Significant) Temperature:Solar_Radiation: p-value = 0.000 (Significant) Temperature:Precipitation: p-value = 0.000 (Significant) Temperature:Humidity: p-value = 0.000 (Significant) Humidity:Precipitation: p-value = 0.152 (Not Significant) Temperature:Rainfall_log: p-value = 0.454 (Not Significant) Temperature:Snowfall_log: p-value = 0.349 (Not Significant) Solar_Radiation:Precipitation: p-value = 0.878 (Not Significant) Solar_Radiation:Rainfall_log: p-value = 0.000 (Significant) Solar_Radiation:Holiday: p-value = 0.045 (Significant) Holiday:Snowfall_log: p-value = 0.029 (Significant) Temperature:Holiday: p-value = 0.315 (Not Significant)
While it is not concerning that Hour_9 and Hour_15 are not significant because the rest of the dummy-coded categorical variable are significant, the p-value of 0.102 for Snowfall_log is concerning. While it seems like snowfall would be an important indicator for our analysis, we were not able to overcome the extreme case that every observation containing snowfall had. Humidity and Precipitation were both not identified as being significant while they were for the model not containing interactions. This is interesting to look out for moving forward with the model. Temperature and Solar Radiation are significant predictors of bike demand, and their impact is also influenced by certain interaction terms.
Many hour indicators are significant, highlighting specific peak times for rentals, while certain days (e.g., Fridays and Saturdays) also show a clear effect. Holidays alone are significant, but their interaction with certain weather factors (like Solar Radiation) matters as well.
Slope Confidence Intervals¶
(take extra care in providing accurate interpretations if your model includes an interaction term)
confidence_intervals = res_final.conf_int(alpha=0.05) # 95% confidence interval
print("Confidence Intervals for Slopes:")
for feature, (lower, upper) in confidence_intervals[1:].iterrows(): # Exclude the constant term
print(f"{feature}: [{lower:.3f}, {upper:.3f}]")
Confidence Intervals for Slopes: Temperature: [56.154, 63.584] Humidity: [-0.512, 1.471] Solar_Radiation: [254.019, 331.655] Holiday: [-266.786, -129.822] Precipitation: [-195.364, 277.815] Day_Friday: [127.622, 200.870] Day_Monday: [47.792, 121.367] Day_Saturday: [10.695, 85.443] Day_Thursday: [86.066, 160.839] Day_Tuesday: [58.383, 134.288] Day_Wednesday: [104.849, 180.343] Hour_1: [-161.131, -20.913] Hour_2: [-266.870, -127.798] Hour_3: [-312.248, -168.722] Hour_4: [-425.238, -281.925] Hour_5: [-415.912, -272.754] Hour_6: [-221.118, -78.956] Hour_7: [87.602, 231.789] Hour_8: [432.083, 576.592] Hour_9: [-41.281, 104.798] Hour_10: [-298.038, -140.655] Hour_11: [-315.059, -153.839] Hour_12: [-287.602, -120.395] Hour_13: [-293.466, -128.349] Hour_14: [-284.785, -121.138] Hour_15: [-196.332, -39.382] Hour_16: [-47.939, 103.656] Hour_17: [252.380, 399.910] Hour_18: [722.451, 864.279] Hour_19: [438.964, 580.791] Hour_20: [390.981, 532.619] Hour_21: [389.712, 530.096] Hour_22: [315.110, 459.414] Hour_23: [42.488, 183.093] Snowfall_log: [-86.180, 74.784] Rainfall_log: [-276.162, -51.456] Temperature:Solar_Radiation: [-13.665, -11.097] Temperature:Precipitation: [-20.865, -10.233] Temperature:Humidity: [-0.504, -0.384] Humidity:Precipitation: [-5.235, 0.814] Temperature:Rainfall_log: [-3.716, 8.308] Temperature:Snowfall_log: [-24.714, 8.725] Solar_Radiation:Precipitation: [-127.644, 149.240] Solar_Radiation:Rainfall_log: [416.342, 993.699] Solar_Radiation:Holiday: [1.363, 120.303] Holiday:Snowfall_log: [37.170, 679.034] Temperature:Holiday: [-2.220, 6.883]
Significant Interactions
Temperature: The effect of temperature on bike demand is positive, meaning higher temperatures are associated with increased rentals.Solar Radiation: Similarly, solar radiation has a large positive impact on bike rentals, consistent with people preferring to bike in sunny weather.Holiday: The negative interval suggests lower demand on holidays, potentially because people have alternative recreational options.Day Indicators(e.g., Friday, Monday, etc.): Positive intervals for most days imply higher bike rentals, particularly on certain days of the week.Hour Indicators: There’s a mix of positive and negative intervals. For example:Hour_8: Bike rentals peak during this hour, consistent with morning commutes.Hour_18: This evening peak indicates high demand likely for post-work commutes.Hours with Negative Intervals: Early morning hours like Hour_3 and Hour_4 have negative intervals, indicating lower demand.
Interaction Terms
Temperature:Solar Radiation: This negative interaction suggests that as both temperature and solar radiation increase, the marginal effect of temperature on bike rentals decreases slightly, perhaps due to extreme heat reducing appeal.Solar Radiation:Rainfall_log: This positive interaction suggests that on days with both high solar radiation and rainfall, demand remains relatively high, possibly because light rain does not deter bikers as long as it’s sunny.Holiday:Snowfall_log: Holidays with snow see a complex effect; while snow might decrease outdoor activity, holidays likely counter this effect somewhat.Temperature:Rainfall_log/Solar_Radiation:Precipitation: Both of these intervals indicates that the combined effect of temperature and rainfall on bike rentals is not significant and sunny, rainy days do not show a unique effect. This is interesting given that they interaction was significant when compared when compared to the model without interactions, but with many other interactions it is not.
Mean of Y Confidence Intervals¶
We tried to pick values for a random day in October. The day is a Friday at noon, not a holiday, no precipitation.
# Create a function to generate an observation with interactions
def create_observation(temp=np.mean(bike_df["Temperature"]), humidity=np.mean(bike_df["Humidity"]), solar_rad=np.mean(bike_df["Solar_Radiation"]), holiday=0, precip=0, snow=0, rain=0, day="Wednesday", hour=12, exog_names=res_final.model.exog_names):
# Initialize base observation with main effect variables
observation = {
'const': [1],
'Temperature': [temp],
'Humidity': [humidity],
'Solar_Radiation': [solar_rad],
'Holiday': [holiday],
'Precipitation': [precip],
'Snowfall_log': [np.log(snow + 1)], # Log transform for snowfall
'Rainfall_log': [np.sqrt(rain)] # Sqrt transform for rainfall
}
# Add day of the week columns, setting 1 for selected day, 0 for others
days_of_week = ['Day_Monday', 'Day_Tuesday', 'Day_Wednesday', 'Day_Thursday', 'Day_Friday', 'Day_Saturday']
for d in days_of_week:
observation[d] = [1 if d == f"Day_{day}" else 0]
# Add hour columns, setting 1 for selected hour, 0 for others
for h in range(1,24):
observation[f'Hour_{h}'] = [1 if h == hour else 0]
# Convert the dictionary to a DataFrame
observation_df = pd.DataFrame(observation)
# Add interaction terms dynamically based on exog_names
for name in exog_names:
if ':' in name: # Identifies interaction terms
# Split interaction term into its components
var1, var2 = name.split(':')
# Ensure both variables exist in observation
if var1 in observation_df.columns and var2 in observation_df.columns:
# Calculate the interaction term and add it to the DataFrame
observation_df[name] = observation_df[var1] * observation_df[var2]
return observation_df
# Generate the observation with interactions
new_observation = create_observation(
temp=16.8,
humidity=28,
solar_rad=2.1,
day='Friday',
hour=12
)
res_final.get_prediction(new_observation).summary_frame(alpha=0.05).loc[:, ['mean_ci_lower', 'mean_ci_upper']]
| mean_ci_lower | mean_ci_upper | |
|---|---|---|
| 0 | 752.010396 | 934.090307 |
For a random day in October (with the above weather conditions), we are 95% confident that the average rented bike count will be between 752.0 and 934.1.
Prediction Intervals¶
For the same observation as above.
res_final.get_prediction(new_observation).summary_frame(alpha = 0.05).loc[:, ['obs_ci_lower', 'obs_ci_upper']]
| obs_ci_lower | obs_ci_upper | |
|---|---|---|
| 0 | 133.94943 | 1552.151273 |
For the same random day in October, we are 95% confident that the actual rented bike count will fall between 133 and 1641 bikes.
Conclusion¶
With relative certainty in our model, we can predict the demand for a bike on a given day based on the day of the week, time, and weather conditions. The model has large positive slopes at peak times for workers commuting to work, specifically 8:00 am and 6:00 PM. Meanwhile, there is a large negative slope corresponding to precipitation, indicating that fewer people want to travel in poor weather conditions, further supported by the positive slope for solar radiation and temperature.
When performing maintenance on bikes, in order to avoid removing bikes at peak demand times, we propose Tuesday between 10:00 AM and 3:00 PM. Demand is the lowest at these times, and falls during normal working hours, acting as a practical solution, although between 1:00 AM and 6:00 AM or on Saturdays is technically a lower demand period. If provided with the cost structure for the bikes, we may be able to determine the cost and benefit of paying workers extra do perform weekend maintenance instead, to keep bikes in circulation during weekdays while they have higher demand.
A dynamic pricing model is practical to be implemented. It may be based on weather forecasts, times of day, or both to capitalize on profits during high demand times, while encouraging bike rental during lower demand times, especially on rainy days, or holidays that have the most significant negative effect on demand. However, without a pricing structure provided in the dataset, it is difficult to determine an exact model.
It would also be beneficial to do research on the demand at each given station, and the nature of the bikes return to said stations. This may provide better insight on which stations specifically have high demand from business people, while tourists may have a higher demand on weekends or during the midday. Analyzing how often bikes return to their original stations, or return to any station at all, could provide us with other necessary jobs to complete at low demand times, such as relocating bikes to high demand stations, and returning bikes to accessible locations.
Overall, our model is insightful in showing the demand of bikes based on the time of day and the weather conditions. While further research and data could potentially make the model more robust, and improve its applications, the model in its current form is appropriate for predicting demand, and answering business questions beyond our current findings.